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Abstract. In 1971, Knutli gave an 0(n 2 )-time algorithm for the clas¬ 
sic problem of finding an optimal binary search tree. Knuth’s algorithm 
works only for search trees based on 3-way comparisons, but most modern 
computers support only 2-way comparisons (<, <> =, >, and >). Un¬ 
til this paper, the problem of finding an optimal search tree using 2-way 
comparisons remained open — poly-time algorithms were known only for 
restricted variants. We solve the general case, giving (i) an 0(n 4 )-time al¬ 
gorithm and (ii) an 0(n log n)-time additive-3 approximation algorithm. 
For finding optimal binary split trees , we (iii) obtain a linear speedup 
and (iv) prove some previous work incorrect. 


1 Background and statement of results 

In 1971, Knuth [29 gave an 0(n 2 )-time dynamic-programming algorithm for a 
classic problem: given a set K, of keys and a probability distribution on queries, 
find an optimal binary-search tree T. As shown in Fig. [I] a search in such a tree 
for a given value v compares v to the root key, then (i) recurses left if v is smaller, 
(ii) stops if v equals the key, or (iii) recurses right if v is larger, halting at a leaf. 
The comparisons made in the search must suffice to determine the relation of 
v to all keys in 1C. (Hence, T must have 2|/C| + 1 leaves.) T is optimal if it has 
minimum cost , defined as the expected number of comparisons assuming the 
query v is chosen randomly from the specified probability distribution. 

Knuth assumed three-way comparisons at each node. With the rise of higher- 
level programming languages, most computers began supporting only two-way 
comparisons (<,<,=,>,>). In the 2nd edition of Volume 3 of The Art of Com¬ 
puter Programming E §6.2.2 ex. 33], Knuth commented 

.. . machines that cannot make three-way comparisons at once.. . will have to 
make two comparisons. . . it may well be best to have a binary tree whose inter¬ 
nal nodes specify either an equality test or a less-than test but not both. 

But Knuth gave no algorithm to find a tree built from two-way comparisons (a 
2wcst, as in Fig. |2ja)), and, prior to the current paper, poly-time algorithms 
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Fig. 2. Two 2wcsts for 1C = {H, O, W}; tree (b) only handles successful queries. 


were known only for restricted variants. Most notably, in 2002 Anderson et al. |T] 
gave an 0(n 4 )-time algorithm for the successful-queries variant of 2WCST, in 
which each query v must be a key in 1C, so only |/C| leaves are needed (Fig.[2jb)). 
The standard problem allows arbitrary queries, so 2|/C| + 1 leaves are needed 
(Fig. |2ja)). For the standard problem, no polynomial-time algorithm was pre¬ 
viously known. We give one for a more general problem that we call 2WCST: 


Theorem 1. 2WCST has an 0{n A )-time algorithm. 

We specify an instance I of 2WCST as a tuple 1 = (K, = {A'i, ..., K n }, Q, C, a, P). 
The set C of allowed comparison operators can be any subset of {<, <, =, >, >}. 
The set Q specifies the queries. A solution is an optimal 2WCST T among those 
using operators in C and handling all queries in Q. This definition generalizes 
both standard 2WCST (let Q contain each key and a value between each pair 
of keys), and the successful-queries variant (take Q = K. and a = 0). It further 
allows any query set Q between these two extremes, even allowing K, % Q. As 
usual, pi is the probability that v equals K,\ at is the probability that v falls 
between keys Ki and K i+ 1 (except «o = Pr[u < I\i] and a n = Pr[v > K n ]). [^] 
To prove Thm.^ we prove Spuler’s 1994 “maximum-likelihood” conjecture: in 
any optimal 2WCST tree, each equality comparison is to a key in 1C of maximum 
likelihood, given the comparisons so far m §6.4Conj.l]. As Spuler observed, 
the conjecture implies an 0(n 5 )-time algorithm; we reduce this to 0(n 4 ) using 

4 As defined here, a 2wcst T must determine the relation of the query v to every 

key in 1C. More generally, one could specify any partition V of Q , and only require 
T to determine, if at all possible using keys in 1C, which set S £ V contains v. For 
example, if V = {1C, Q \ 1C}, then T would only need to determine whether v G 1C. 
We note without proof that Thm. [I] extends to this more general formulation. 






















3 


standard techniques and a new perturbation argument. Anderson et al. proved 
the conjecture for their special case [T] Cor. 3]. We were unable to extend their 
proof directly; our proof uses a different local-exchange argument. 

We also give a fast additive-3 approximation algorithm: 

Theorem 2. Given any instance I = {K., Q,C,a, (3) of 2WCST, one can compute 
a tree of cost at most the optimum plus 3, in 0{n log n) time. 

Comparable results were known for the successful-queries variant (Q = 1C) |16ll| . 
We approximately reduce the general case to that case. 


Binary split trees “split” each 3-way comparison in Knuth’s 3-way-comparison 
model into two 2-way comparisons within the same node: an equality compar¬ 
ison (which, by definition, must be to the maximum-likelihood key) and a “<” 
comparison (to any key) [13131811215] . The fastest algorithms to find an optimal 
binary split tree take 0(n 5 )-time: from 1984 for the successful-queries-only vari¬ 
ant (Q = K ) [5]; from 1986 for the standard problem (Q contains queries in all 
possible relations to the keys in K) [6]. We obtain a linear speedup: 


Theorem 3. Given any instance I = (JC = {Ki ,..., K n }, a , (3) of the standard 
binary-split-tree problem, an optimal tree can be computed in 0(n 4 ) time. 


The proof uses our new perturbation argument (Sec. 3.1) to reduce to the case 
when all /Vs are distinct, then applies a known algorithm [5]. The perturbation 
argument can also be used to simplify Anderson et al.’s algorithm |T|. 


Generalized binary split trees (gbsts) are binary split trees without the maximum- 
likelihood constraint. Huang and Wong [5] (1984) observe that relaxing this con¬ 
straint allows cheaper trees — the maximum-likelihood conjecture fails here — 
and propose an algorithm to find optimal GBSTs. We prove it incorrect! 

Theorem 4. Lemma f of Jiff is incorrect: there exists an instance — a query 
distribution /3 — for which it does not hold, and on which their algorithm fails. 

This flaw also invalidates two algorithms, proposed in Spuler’s thesis 113, that 
are based on Huang and Wong’s algorithm. We know of no poly-time algorithm 
to find optimal gbsts. Of course, optimal 2wcsts are at least as good. 


2wcst without equality tests. Finding an optimal alphabetical encoding has 
several poly-time algorithms: by Gilbert and Moore — 0(n 3 ) time, 1959 |5j; 
by Hu and Tucker — 0(n log n) time, 1971 [7]; and by Garsia and Wachs 
- O(nlogn) time but simpler, 1979 [1|. The problem is equivalent to find¬ 
ing an optimal 3-way-comparison search tree when the probability of querying 
any key is zero (/3 = 0) [Tl) §6.2.2], It is also equivalent to finding an opti¬ 
mal 2wcst in the successful-queries variant with only “<” comparisons allowed 
(C = {<}, Q = K.) [H §5.2]. We generalize this observation to prove Thm. [5] 

Theorem 5. Any 2WCST instance I = (K. = {Ki, ■ ■ ■ , K n }, Q,C,a,/3) where = 
is not in C (equality tests are not allowed), can be solved in 0(n log n) time. 
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Definitions 1 Fix an arbitrary instance I = (1C, Q,C,a, f3). 

For any node N in any 2WCST T for I, N’s query subset, Qn, contains 
queries v £ Q such that the search for v reaches N. The weight u>(N) of N is 
the probability that a random query v (from distribution (a, (3)) is in Qn- The 
weight oj(T') of any subtree T' of T is u(N) where N is the root of T'. 

Let (v < Kf) denote an internal node having key Ki and comparison operator 
< (define (v < Kf) and (v = Ki) similarly). Let (Kf) denote the leaf N such that 
Qn = {-Ki}. Abusing notation, u>(Kf) is a synonym for oj((Kf)), that is, /3j. 

Say T is irreducible if for every node N with parent N', Qn Qn 1 ■ 

In the remainder of the paper, we assume that only comparisons in {<, <, =} 
are allowed (i.e., C C {<, <, =}). This is without loss of generality, as “v > K” 
and "v > Kf’ can be replaced, respectively, by “v < K” and u v < Ki.” 


2 Proof of Spuler’s conjecture 


Fix any irreducible, optimal 2WCST T for any instance I = (1C, Q,C,a,/3). 

Theorem 6 (Spuler’s conjecture). The key K a in any equality-comparison 
node N = (v = K a ) is a maximum-likelihood key: (3 a = maxi{/3i : Ki £ Qn}- 

The theorem will follow easily from Lemma |T| 

Lemma 1. Let internal node (v = K a ) be the ancestor of internal node ( v = Kf). 
Then ui(K a ) > u>(K z ). That is, /3 a > /3 Z . 

Proof. (Lemma [7p Throughout, “(u -< Kf)” denotes a node in T that does an 
inequality comparison (< or <, not =) to key I\i. Abusing notation, in that 
context, “x -< K” (or 11 x Kf’) denotes that x passes (or fails) that comparison. 

Assumption 1 (i) All nodes on the path from (v = K a ) to (v = Kf) do in¬ 
equality comparisons, (ii) Along the path, some other node (v -< Kf) separates 
key K a from K z : either K a -< K s but K z -f K s , or K z -< K s but K a -f K s . 


It suffices to prove the lemma assuming (i) and (ii) above. (Indeed, if the lemma 
holds given (i), then, by transitivity, the lemma holds in general. Given (i), if 
(ii) doesn’t hold, then exchanging the two nodes preserves correctness, changing 
the cost by ( io(K a ) — co(K z )) x d for d > 1, so u>(K a ) > co(K z ) and we are done.) 

By Assumption]!] the subtree rooted at (v = Kf), call it T', is as in Fig. [3] a): 
Let child (v -< Kf), with subtrees To and T\, be as in Fig. [3] 

Lemma 2. If K a -< Kb, then uj(K a ) > w(Tf), else to(K a ) > w(T 0 ). 


(This and subsequent lemmas in this section are proved in Appendix 7.2 The 
idea behind this one is that correctness is preserved by replacing T' by subtree 
(b) if K a -< Kb or (c) otherwise, implying the lemma by the optimality of T.) 


Case 1 : Child (v -< Kf) separates I\ a from K z . If K a -< Kb, then K z -f Kb, so de¬ 
scendant (v = Kf) is in T\, and, by this and Lemma[2] ui ( K a ) > u>(Tf) > co(K z ), 
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Fig. 3. (a) The subtree T' rooted at (v = K a ) and possible replacements (b), (c). 


and we’re done. Otherwise K a ft, I\b, so K z -< Kb, so descendant (v = K z ) is in 
Tq, and, by this and Lemma[2j io(K a ) > w(T 0 ) > oj(K z ), and we’re done. 

Case 2 : Child (v -< Kb) does not separate K a from K z . Assume also that descen¬ 
dant (v = K z ) is in T). (If descendant (v = K z ) is in T 0 , the proof is symmetric, 
exchanging the roles of Tq and T\.) Since descendant (v = K z ) is in T\, and child 
(v -< Kb) does not separate I\ a from K z , we have K a f Kb and two facts: 

Fact A: uj(K a ) > uj(Tq) (by Lemma |2j, and 

Fact B: the root of T\ does an inequality comparison (by Assumption [l]). 

By Fact B, subtree T' rooted at (v = K a ) is as in Fig.jdja): 



Fig. 4. (a) The subtree T' in Case 2, two possible replacements (b), (c). 


As in Fig. |4](a), let the root of T\ be (v -< K c ), with subtrees Tio and Tn. 
Lemma 3. (i) w(T 0 ) > w(Tu). (ii) If K a -ft, K c , then ui(K a ) > w(Ti). 


(As replacing T' by (b) or (c) preserves correctness; proof in Appendix 7.2 ) 
Case 2.1 : K a ft, K c . By Lemma [3jii) , ui{K a ) > w(Ti). Descendant (v = K z ) is 
in Ti, so w(Ti) > uj(K z ). Transitively, ui(K a ) > u}(K z ), and we are done. 


Case 2.2 : K a -< K c . By Lemma |3[i), w(T 0 ) > w(Tn). By Fact A, u)(K a ) > 

w(T n ). 

If ( v = K z ) is in Tn, then ui(Tn) > co(K z ) and transitively we are done. 

In the remaining case, (v = K z ) is in T\q. T’s irreducibility implies K z -< K c . 
Since K a -< K c also (Case 2.2), grandchild ( v -< K c ) does not separate K„ from 
K z , and by Assumption [^the root of subtree Tio does an inequality comparison. 
Hence, the subtree rooted at ( v -< Kf) is as in Fig.[5ja): 
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Fig. 5. (a) The subtree rooted at (v -< Kf) in Case 2.2. (b) A possible replacement. 
Lemma 4. w(T 0 ) > w(T 10 ). 

(Because replacing (a) by (b) preserves correctness; proof in Appendix |7.2| ) 
Since descendant (v = K z ) is in Tio, Lemma [4] implies u(Tq) > w(Xio) > 
u(K z ). This and Fact A imply ui(K a ) > co(K z ). This proves Lemma [T] □ 

Proposition 1. If any leaf node (Kg) ’s parent P does not do an equality com¬ 
parison against key Kg, then changing P so that it does so gives an irreducible 
2wcst T' of the same cost. 

Proof. Since Q(K t ) = {Kg} and P’s comparison operator is in C C {<, <, =}, it 
must be that Kg = max Qp or Kg = min Qp. So changing P to (v = Kg) (with 
(Kg) as the“yes” child and the other child the “no” child) maintains correctness, 
cost, and irreducibility. □ 

Proof. (Thm. [h|) Consider any equality-testing node N = (v = K a ) and any key 
K, £ Qpg. Since I\ z £ Qn, node N has descendant leaf (K z ). Without loss of 
generality (by Proposition [TJ , leaf (K z )’s parent is (v = K z ). That parent is a 
descendant of (v = K a ), so oj(K a ) > oj(K z ) by Lemma |T| □ 


3 Proofs of Thm. [l] (algorithm for 2wcst) and Thm. [3] 

First we prove Thm.[l] Fix an instance I = (/C, Q,C, a,/3). Assume for now that 
all probabilities in ft are distinct. For any query subset S C Q, let opt(<S) denote 
the minimum cost of any 2WCST that correctly determines all queries in subset S 
(using keys in 1C, comparisons in C, and weights from the appropriate restriction 
of a and /3 to S). Let w(«S) be the probability that a random query v is in S. 
The cost of any tree for S is the weight of the root (= w(5)) plus the cost of its 
two subtrees, yielding the following dynamic-programming recurrence: 

Lemma 5. For any query set S C Q not handled by a single-node tree, 


opt (5) 


oj(S) + min 


min opt(<S \ {k}) (if “=” is in C, else oo) 
k 

min opt(S£) + opt(5\5^), 


(a) 


where k ranges over K,, and -< ranges over the allowed inequality operators (if 
any), and Sj( = {v £ S : v -< k}. 







7 


Using the recurrence naively to compute opt(Q) yields exponentially many query 
subsets S, because of line (i). But, by Thm. [6j we can restrict k in line (i) to be 
the maximum-likelihood key in S. With this restriction, the only subsets S that 
arise are intervals within Q, minus some most-likely keys. Formally, for each of 
0(n 2 ) key pairs {ki,k 2 } Q /CU{— 00 , 00 } with k± < k 2 , define four key intervals 

(fci, fe 2 ) = {v £ Q : ki < v < k 2 }, [k\,k 2 \ = {v £ Q : ki < v < k 2 }, 

(fei, k 2 ] = {n G Q : h < v < k 2 }, [fci, k 2 ) = {v £ Q : k 1 <v < k 2 }. 

For each of these 0(n 2 ) key intervals I, and each integer h < n, define top (I, h) 
to contain the h keys in I with the h largest /Vs. Define S(I, h) = I \ top(I, h). 
Applying the restricted recurrence to S(I, h ) gives a simpler recurrence: 

Lemma 6. If S(I, h) is not handled by a one-node tree, then opt(<S(/, h)) equals 


u>(S(I, h)) + min 


opt(<5>(/, h + 1)) (if equality is in C, else 00 ) 
min opt(«S(/fc , hf)) + opt(5(J \Ijf,h— hf)), 

k ,-< 


(0 

(ii) 


where key interval If = {v £ I : v -< k}, and hf = \ top(/, h) <1 If |. 

Now, to compute opt(Q), each query subset that arises is of the form S(I, h) 
where / is a key interval and 0 < h < n. With care, each of these 0(n 3 ) 
subproblems can be solved in 0(n) time, giving an 0(n 4 )-time algorithm. In 
particular, represent each key-interval / by its two endpoints. For each key- 
interval / and integer h < n, precompute uj(S(I , h)), and top(/, h ), and the h’th 
largest key in I. Given these 0(n 3 ) values (computed in 0(n 3 logn) time), the 
recurrence for opt(5(/, h)) can be evaluated in 0{n ) time. In particular, for line 
(ii), one can enumerate all Oin) pairs (k,hf) in 0[n) time total, and, for each, 
compute If and I \ If in 0(1) time. Each base case can be recognized and 
handled (by a cost-0 leaf) in 0(1) time, giving total time 0(n 4 ). This proves 
Thm. [T] when all probabilities in /3 are distinct; Sec. |3.1| finishes the proof. 


3.1 Perturbation argument; proofs of Theorems [T| and [3] 

Here we show that, without loss of generality, in looking for an optimal search 
tree, one can assume that the key probabilities (the /Vs) are all distinct. Given 
any instance X = (1C, Q,C,a,j3), construct instance X' = (1C, Q,C,a, ft'), where 
/?} = 8j + je and e is a positive infinitesimal (or e can be understood as a 
sufficiently small positive rational). To compute (and compare) costs of trees 
with respect to X', maintain the infinitesimal part of each value separately and 
extend linear arithmetic component-wise in the natural way: 

1. Compute x x (x\ + x 2 e) as (zx 1 ) + (zx 2 )e, where z, x\,x 2 are any rationals, 

2. compute (x x + ex 2 ) + (y 1 + ey 2 ) as (xi + x 2 ) + (y 1 + y 2 )e, 

3. and say X\ + ex 2 < yi + ey 2 iff X\ < yi, or X\ = yi A x 2 < y 2 . 

Lemma 7. In the instance X', all key probabilities 8( are distinct. If a tree T is 
optimal w.r.t. X ', then it is also optimal with respect to X. 
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Proof. Let A be a tree that is optimal w.r.t. X'. Let B be any other tree, and 
let the costs of A and B under I' be, respectively, a\ + a 2 e and b\ + b 2 e. Then 
their respective costs under I are a\ and b±. Since A has minimum cost under 
I', ai + a 2 e < b± + b 2 e. That is, either a± < b±, or ai = 61 (and a 2 < b 2 ). Hence 
a\ < b\\ that is, A costs no more than B w.r.t. X. Hence A is optimal w.r.t. X. □ 

Doing arithmetic this way increases running time by a constant factorj^This 
completes the proof of Thm. [T] The reduction can also be used to avoid the 
significant effort that Anderson et al. [I] devote to non-distinct key probabilities. 

For computing optimal binary split trees for unrestricted queries, the fastest 
known time is 0(n 5 6 ), due to (6] . But [6] also gives an 0(n A )-time algorithm 
for the case of distinct key probabilities. With the above reduction, the latter 
algorithm gives 0(n 4 ) time for the general case, proving Thm. [ 3 ] 

4 Proof of Thm. [ 2 ] (additive-3 approximation algorithm) 

Fix any instance X = (1C, Q,C,a,/3). If C is {=} then the optimal tree can be 
found in 0(n log n) time, so assume otherwise. In particular, < and/or < are in 
C. Assume that < is in C (the other case is symmetric). 

The entropy Hz = — ■ fit log 2 f3i — at log 2 m is a lower bound on opt(I). 

For the case JC = Q and C = {<}, Yeung’s 0(n)-time algorithm jTB] constructs 
a 2 wcst that uses only <-comparisons whose cost is at most Hz + 2 — /3i — j3 n . 
We reduce the general case to that one, adding roughly one extra comparison. 

Construct X' = (1C = 1C, Q! = 1C, C = {<}, a',/?') where each o! i = 0 and 
each /3' = /3j + a , (except /?( = a 0 + Pi + oq). Use Yeung’s algorithm [IBj to 
construct tree T' for X'. Tree T' uses only the < operator, so any query v £ Q 
that reaches a leaf (Kf) in T' must satisfy K; h < v < Af +1 (or v < K 2 if i = 1). 
To distinguish A/ = v from K, < v < Aj+i, we need only add one additional 
comparison at each leaf (except, if '< = 1, we need two)[^]By Yeung’s guarantee, 
T' costs at most Hz' + 2 — /3( — j3' n . The modifications can be done so as to 
increase the cost by at most 1 + ao + ct\, so the final tree costs at most Hz' + 3. 
By standard properties of entropy, Hz' < Hz < opt(X), proving Thm. [ 2 ] 

5 Proof of Thm. [5] (errors in work on binary split trees) 

A generalized binary split tree (gbst) is a rooted binary tree where each node 
N has an equality key ejy and a split key sat. A search for query v £ Q starts 
at the root r. If v = e r , the search halts. Otherwise, the search recurses on the 
left subtree (if v < s r ) or the right subtree (if v > s r ). The cost of the tree 
is the expected number of nodes (including, by convention, leaves) visited for a 
random query v. Fig. [6] shows two GBSTs for a single instance. 

5 For an algorithm that works with linear (or 0(l)-degree polynomial) functions of /3. 

6 If it is possible to distinguish v = Kt from Ki < v < Aj+i, then C must have at 
least one operator other than <, so we can add either (v = Ki) or (v < Ki). 
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Fig. 6. Two gbsts for an instance. Keys are ordered alphabetically (AO < A1 < A2 < 
A3 < BO < ■ ■ ■)■ Each node shows its equality key and the frequency of that key; split 
keys are not shown. The algorithm of [9] gives (a), of cost 1763, but (b) costs 1762. 


To prove Thm. [d] we observe that [5] ’s Lemma 4 and algorithm fail on the 
instance in Fig. [6] There is a solution of cost only 1762 (in Fig. |6jb)), but the 
algorithm gives cost 1763 for the instance (as in Fig. [6](a)), as can be verified by 
executing the Python code for the algorithm in Appendix |7.1[ The intuition is 
that the optimal substructure property fails for the subproblems defined by [S]: 
the circled subtree in (a) (with root A2) is cheaper than the corresponding 
subtree in (b), but leads to larger global cost. For more intuition and the full 
proof, see Appendix |7.3| 

6 Proof of Thm. [5] (0(n log n) time without equality) 

Fix any 2WCST instance X = (A, Q,C,a,/3) with C C {<,<}. Let n = |/C|. We 
show that, in 0(n log n) time, one can compute an equivalent instance X' = 
(1C, Q!,C',a', j3') with 1C = Q! , C' = {<}, and \K'\ < 2n + l. (Equivalent means 
that, given an optimal 2WCST T' for X', one can compute in 0(nlogn) time an 
optimal 2WCST T for X.) The idea is that, when C C {<,<}, open intervals are 
functionally equivalent to keys. 

Assume without loss of generality that C = {<,<}. (Otherwise no correct 
tree exists unless 1C = Q, and we are done.) Assume without loss of generality 
that no two elements in Q are equivalent (in that they relate to all keys in 1C 
in the same way; otherwise, remove all but one query from each equivalence 
class). Hence, at most one query lies between any two consecutive keys, and 
\Q\ < 2|/C| + 1. 

Let instance X' = (1C, Q,C,o! , /?') be obtained by taking the key set 1C = 
Q to be the key set, but restricting comparisons to C' = {<} (and adjusting 
the probability distribution appropriately — take a' == 0, take to be the 
probability associated with the ith query — the appropriate ay or 0j). 
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Given any irreducible 2WCST T for X, one can construct a tree T' for I' of 
the same cost as follows. Replace each node (v < k) with a node (v < q), where 
q is the least query value larger than k (there must be one, since (v < k) is in T 
and T is irreducible). Likewise, replace each node (v < k) with a node (v < q), 
where q is the least query value greater than or equal to k (there must be one, 
since (v < k) is in T and T is irreducible). T' is correct because T is. 

Conversely, given any irreducible 2WCST T' for I', one can construct an 
equivalent 2WCST T for X as follows. Replace each node N' = ( v < q) as follows. 
If q £ 1C, replace N' by (v < k ). Otherwise, replace N' by (v < k), where key k 
is the largest key less than q. (There must be such a key k. Node (v < q) is in 
T' but T' is irreducible, so there is a query, and hence a key k, smaller than q.) 
Since T 1 correctly classifies each query in Q, so does T. 

To finish, we note that the instance I' can be computed from I in 0(n log n) 
time (by sorting the keys, under reasonable assumptions about Q), and the 
second mapping (from T' to T ) can be computed in 0(n log n) time. Since I' 
has K! = Q! and C = {<}, it is known jUT to be equivalent to an instance of 
alphabetic encoding, which can be solved in 0(n logn) time pTHj . 
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7 Appendix 

7.1 Python code for Thm. [4] (gbst algorithm of |9]) 


9^!/usr/bin/env python3.4 
import functools 

memoize — functools.lru_cache(maxsize=None) 
def huangl984(weights): 

"Returns cost as computed by Huang and Wong's GBST algorithm (1984)." 
n — len(weights) 

beta = {i+1 : weights[key] for i, key in enumerate(sorted(weights.keys()))} 

def is_legal(i, j, d): return 0 <= i <= j <= n and 0 <= d <= j -i 

@memoize 

def p_w_t(i, j, d): 

"Returns triple: (cost p[i,j,d], weight w[i,j,d], deleted keys for t[i,j,d])." 

interval — set (range (i+1, j+1)) 

if d —— j-i: # base case 
return (0, 0, interval) 

def candidatesQ: # Lemma 4 recurrence from Huang et al 
for k in interval: ^ k = index of split key 

for m in range(d+2): ^ m. — num. deletions from left subtree 
if is_legal(i, k-1, m) and is_legal(k-l, j, d-m+1): 
cost_l, weight_l, deleted_l = p_w_t(i, k-1, m) 
cost_r, weight_r, deleted_r - p_w_t(k-1. j, d-m+1) 
deleted — deleted_l .union( deleted_r ) 
x = min(deleted, key = lambda h : betafh]) 
weight = beta[x] + weight _1 + weight _r 
cost = weight + cost_l + cost_r 
yield cost, weight, deleted -set([x]) 

return min(candidates()) 

cost, weight, keys = p_w_t(0, n, 0) 
return cost 

weights — dict(b4=20, 

a3=20, v3=20, 

a2=20, p2=20, t2=20, x2=20, 

al=20, dl=22, nl=20, ql=20, sl=20, ul=20, wl=20, yl=20, 
b0=10, c0= 5, d0=10, e0=10, n0=10, p0=10, q0=10, r0=10, 
s0=10, t0=10, u0=10, v0=10, w0=10, x0=10, y0=10, z0=10) 

assert huangl984(weights) == 1763 # Both assertions pass. The first is used in our Thm. 4. 

weights['dl'] +— 0.99 H 2 Increasing a weight cannot decrease the optimal cost, but 

assert huangl984(weights) < 1763 H i n this case decreases the cost computed by the algorithm. 

The extended abstract |2] is essentially the body of this paper minus the remain¬ 
der of this appendix. The remainder of this appendix contains all proofs omitted 
from the extended abstract. 
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7.2 Proof of Lemmas [2]^4] (in the proof of Spuler’s conjecture) 

We prove some slightly stronger lemmas that imply Lemmas m 

Let T be any irreducible, optimal 2WCST as in the proof of Lemma [1] 

Lemma 8 (implies Lemma [ 2 ]). Assume T has a subtree as in Fig. with 
nodes ( v = K a ) and (v -< Kb) . (i) Replacing that subtree the one in Fig. UN ft 
K a -< Kb) or the one in Fig. UN (if K a 7 K Kb) preserves correctness, (ii) If 
K a -< Kb, then u>(K a ) > uj(Ti); otherwise co{K a ) > w(T 0 ). 

Proof. Assume that K a -< Kb (the other case is symmetric). By inspection of 
each case ( Q = I\ a or Q / K a ), subtree (b) classifies each query Q the same 
way subtree (a) does, so the modified tree is correct. The modification changes 
the cost by u)(K a ) — w(Ti), so (since T has minimum cost) u(K a ) > u>(Ti). □ 



Fig. 7. Lemmaj9]— “Rotating” subtree (a) yields (c); the subtrees are interchangeable. 


Lemma 9 (implies Lemma [3](i) ). (i) If T has either of the two subtrees in 
Fig.^a) or (c), then exchanging one for the other preserves correctness, (ii) If 
T has the subtree in Fig. then u}(Tq) > w(Tji). 

Proof. Part (i). The transformation from (a) to (c) is a standard rotation oper¬ 
ation on binary search trees, but, since the comparison operators can be either 
< or < in our context, we verify correctness carefully. 

By inspection, replacing subtree (a) by subtree (b) (in Fig. [7]) gives a tree 
that classifies all queries as T does, and so is correct. 

Next we observe that, in subtree (b), replacing the right subtree by just Tji 
(to obtain subtree (c)), maintains correctness. Indeed, since T is irreducible, 
replacing (in (a)) the subtree T\ by just Tji would give an incorrect tree. 
Equivalently, 3 Q. Q -f. Kb A Q -< K c . Equivalently, the right-bounded inter¬ 
val {Q £ R : Q -< K c } overlaps the left-bounded interval {Q £ R : Q Kb}. 
Equivalently, the complements of these intervals, namely {Q gR : Q ^ I\ c } and 
{Q £ R : Q -< Kb}, are disjoint. Equivalently, VQ. Q -f K c -> Q / I\b- Hence, 
replacing the right subtree of (b) by Tn (yielding (c)) maintains correctness. 

In sum, replacing subtree (a) by subtree (c) maintains correctness. This shows 
part (i). This replacement changes the cost by w(T 0 )— w(Tn), so w(T 0 ) > ui (Tj-[). 
This proves part (iii). The proof of (ii) is symmetric to the proof of (i). □ 
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Fig 


. 8. Lemma 10 


Subtrees (a) and (c) are interchangeable if K a K c . 


Lemma 10 (implies Lemma |3](ii)). If T has a subtree as in Fig.^a), and 

K a -ft, K c , then (i) replacing the subtree by preserves correctness, and 

(ii) u(K a ) > w(T\). 

Proof, (i) Assume T has the subtree in Fig.[7](a) (the other case is symmetric). 
By Lemma [9ji) (applied to the subtree of (a) with root (v -< Kb)), replacing 
subtree (a) by subtree (b) gives a correct tree. Then, by Lemma [8])i) (applied to 
subtree (b), but note that node (v -< I\ c ) in (b) takes the role of node ( v -< Kb) 
in Fig.ga)!) replacing (b) by (c) gives a correct tree. This proves part (i). Part 
(ii) follows because replacing (a) by (c) changes the cost of T by u(K a ) — cu(Ti), 
and T has minimum cost, so ui(K a ) > uj(T\). □ 



Lemma 11 (implies Lemma [4]). If T has a subtree as in Fig.^a), then (i) 
replacing that subtree by the one in preserves correctness, and (ii) 

w(T 0 ) > w(Tio). 

Proof. Applying Lemma|9|i) to the subtree of (a) with root ( v -< K c ), replacing 
subtree (a) by subtree (b) gives a correct treeQ Then, applying Lemma [9|i) to 
the subtree of (b) with root (v -< Kb), replacing subtree (b) by subtree (c) gives 


' Technically, to apply Lemma |9^i), we need (b) to be correct and irreducible. The 
overall argument remains valid though as long as the tree from Fig.|9la) is optimal. 
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a correct tree. This shows part (i). Part (ii) follows, because replacing (a) by (c) 
changes the cost of T by w(T 0 ) — w(T 10 ), so w(T 0 ) > uj(T 10 ). □ 


7.3 Proof of Thm. [4]— Huang and Wong’s error 

A generalized binary split tree (gbst) is a rooted binary tree where each node 
N has an equality key ejv and a split key Sn- A search for query v € Q starts 
at the root r. If v = e r , the search halts. Otherwise, the search recurses on the 
left subtree (if v < s r ) or the right subtree (if v > s r ). The cost of the tree 
is the expected number of nodes (including, by convention, leaves) visited for a 
random query v. 



Fig. 10. In a gbst, each node has equality key, frequency, and (if internal) split key. 


Huang and Wong demonstrate that equality keys in optimal GBSTs do not 


have the maximum-likelihood property |S]. Fig. 10 shows their counterexample: 
in the optimal GBST (a), the root equality key is E (frequency 20), not B (fre¬ 
quency 22). The cheapest tree with B at the root is (b), and is more expensive. 
Having B at the root increases the cost because then the other two high-frequency 
keys E and F have to be the children, so the split key of the root has to split E 
and F, and low-frequency keys A, C, and D all must be in the left subtree. 

Following [5], restrict to successful queries (1C = Q). Fix any instance I = 
(1C, /3). For any query interval I = {Af, Ki + 1 , ..., Kj} and any subset D C I of 
“deleted” keys, let opt (I,D) denote the minimum cost of any GBST that handles 
the keys in I\D. This recurrence follows directly from the definition of GBSTs: 

Lemma 12. For any query set I\D not handled by a single-node tree, 


opt(I, D) = uj(I \ D) + min opt(7 < 

e,s£/C 


D e n i <s ) + opt (J> s , D e n i> s ) 


where D e = D \ {e}, and I <s = {v £ I: v < s} and I> s = {v £ I: v > s}. 

The goal is to compute opt(A,0). Using the recurrence above, exponentially 
many subsets D arise. This motivates the following lemma. For any node N in 
an optimal GBST, define N’s key interval , In, and deleted-key set, D^, according 
to the recurrence in the natural way. Then the set Qn of queries reaching N is 
In \ Dn, an d Dn contains those keys in In that are in equality tests above N, 
and In contains the key values that, if searched for in T with the equality tests 
removed, would reach node N. 
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Lemma 13 (|9j Lemma 2]). For any node N in an optimal GBST, N ’s equality 
key is a least-frequent key among those in In that aren’t equality keys in any of 
N’s subtrees: if e^ = Kj, then Pi = min{/3 j : Kj £ By}- 

The proof is the same exchange argument that shows our Assumption JTJli) . 

[0] claims (incorrectly) that, by Lemma [l3| the desired value opt(Q, 0) can 
be computed as follows. For any key interval I = {/\y + 1 , ..., Kj} and d < n, let 

P[i, j , d] = min{opt(/ \ D) : D C I, \D\ = d} (1) 

be the minimum cost of any GBST for any query set I\D consisting of I minus 
any d deleted keys. Let t[i,j,d] be a corresponding subtree of cost p[i,j,d\, and 
let w[i,j, d] be the weight of the root of t[i,j. d]. 

Their algorithm uses the following (incorrect) recurrence (their Lemma 4): 

p[i,j, d] = min(u)[i, j, d] + p[i, k — 1, m] + p[k — 1 1]) 

where the minimum is taken over all legal combinations of k’s and m’s [and] 
w[i, j, d] = w[i, k — 1, m] + w[k — 1, j, d — m — 1] + f3 x 
where x is the index of the key of minimum frequency among those in range 
{Ki+ 1 ,..., Kj} but outside t[i, k — 1, m] and t[k — 1, j, d — m + 1]... ” 

Next we describe their error. Recall that p[i,j,d] chooses a subtree of mini¬ 
mum cost (among trees with any d keys deleted). But this choice might not lead 
to minimum overall cost! The reason is that the subtree’s cost does not suffice 
to determine the contribution to the overall cost: the weight of the subtree, and 
the weights of the deleted keys and their eventual locations, also matter. 




Fig. 11. Trees T, T' for 9-key interval I with d = 2. (Split keys not shown.) 


For example, consider p[i,j,d] for the subproblem where d = 2 and interval 
I consists of the nine keys / = {Al, A2, A3, BO, B4, CO, DO, Dl, -E0}, with the 
following weights: 


key: 

A1 

A2 

CO 

BO 

B4 

CO 

B0 

B1 

B0 

weight: 

20 

20 

20 

10 

20 

5 

10 

22 

10 


Fig. [TT| shows two 7-node subtrees (circled and shaded), called T and T ', in¬ 
volving these keys. These subtrees will be used in our counter-example, described 
below. (The split key of each node is not shown in the diagram.) 
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Partition the set of possible trees t[i,j, d] into two classes: (i) those that 
contain D 1 and (ii) those that don’t (that is, D 1 is a “deleted” key). By a careful 
case analysis {^subtree T in Fig. 11 a) is a cheapest (although not unique) tree in 


class (i), while the 7-node subtree T' in Fig. |TT|(b) is a cheapest tree in class (ii). 
Further, the subtree T' costs 1 more than the subtree T. Hence, the algorithm 
of [5] will choose T, not T', for this subproblem. 

However, this choice is incorrect. Consider not just the cost of tree, but also 
the effects of the choice on the deleted keys’ costs. For definiteness, suppose the 
two deleted nodes become, respectively, the parent and grandparent of the root 
of the subtree, as in (a) and (b) of the figure. In (b), CO is one level deeper than 
it is in (a), which increases the cost by 5, but DA is three levels higher , which 
decreases the overall cost by 2 x 3 = 6, for a net decrease of 1 unit. Hence, using 
T instead of T' ends up costing the overall solution 1 unit more. 

This observation is the basis of the complete counterexample shown in Fig. [6] 
The counterexample extends the smaller example above by appending two “neu¬ 
tral” subintervals, with 7 and 15 keys, respectively, each of which (without any 
deletions) admits a self-contained balanced tree. Keys are ordered alphabetically. 
On this instance, the algorithm of 0 (and their Lemma 4) fail, as they choose 
T instead of T' for the subproblem. Fig. |6ja) shows the tree computed by their 
recurrence, of cost 1763. (This can be verified by executing the Python code for 
the algorithm in Appendix |7.1| ) Fig. [6](b) shows a tree that costs 1 less. This 
proves Thm. [T] 


Spuler’s thesis. In addition to Spuler’s conjecture (and 2WCST algorithms that 
rely on his conjecture), Spuler’s thesis [I5J §6.4.1 Conj. 1] also presents code 
for two additional algorithms that he claims, without proof, compute optimal 
2wcsts independently of his conjecture. 

First, (T51 §6.4.1] gives code for the problem restricted to successful queries 
(Q = 1C), which it describes as a “straightforward” modification of Huang et 
al.’s algorithm [|] for generalized split trees (an algorithm that we now know is 
incorrect, per our Thm. |4|. Correctness is addressed only by the remark that 
“The changes to the optimal generalized binary split tree algorithm of Huang and 
Wong m to produce optimal generalized two-way comparison trees^ire quite 
straight forward. ” 

Secondly, ItS, §6.5] gives code for the case of unrestricted queries, which it 
describes as a “not difficult” modification of the preceding algorithm in §6.4.1. 
Correctness is addressed only by the following remark: “The algorithm of the 
previous section assumes that cti = 0 for all i. However, the improvement of 

8 In case (i), to minimize cost, the top two levels of the tree must contain D 1 and two 
other heavy keys from {Al, A2, A3, BA}. D 1 has to be the right child of the root, 
because otherwise a key no larger than BA is, so the split key at the root has to be 
no larger than BA, so the three light keys {CO, DO, DO} have to all be in the right 
subtree, so that one of them has to be at level four instead of level three, increasing 
the cost by at least 5. In case (ii), when D1 is deleted (not in the subtree), by a 
similar argument, one of the light keys has to be at level four, so T' is best possible. 

9 Spuler’s generalized two-way comparison trees are exactly 2wcsts as defined herein. 




IT 


the algorithm to allow non-zero values of a is not difficult.” In contrast, Huang 
et al. explicitly mention that they were unable to generalize their algorithm to 
unrestricted queries 

Neither of Spuler’s two proposed algorithms is published in a peer-reviewed 
venue (although they are referred to in pH]). They have no correctness proofs, 
and are based on jj|], which we now know is incorrect. Given these considerations, 
we judge that Anderson et al. (T] give the first correct proof of a poly-time 
algorithm to find optimal 2wcsts when only successful queries are allowed (1C = 
Q), and that this paper gives the first correct proof for the general case. 


